Benchmark multi-column GROUP BY performance#22322
Conversation
|
|
||
| fn generate_parquet_file(num_cols: usize, cardinality: usize) -> NamedTempFile { | ||
| let mut rng = StdRng::seed_from_u64(42); | ||
| let fields: Vec<Field> = (0..num_cols) |
There was a problem hiding this comment.
Right now the benchmark only generates Int32 grouping columns, so DataFusion will always pick the vectorized GroupValuesColumn path for these multi-column GROUP BYs.
The PR description talks about measuring when the default per-column implementation starts to lose against the row-based GroupValuesRows implementation, but this benchmark never exercises the row-based path, so it cannot really validate the claimed crossover point.
Could we add a second benchmark variant that explicitly uses GroupValuesRows, or benchmark the two GroupValues implementations directly? That would make it possible to compare both implementations across the same cardinality and column-count combinations.
There was a problem hiding this comment.
but this benchmark never exercises the row-based path, so it cannot really validate the claimed crossover point.
Yes, the way I benchmarked is by making code change to enforce one or the other and run them separately.
There was a problem hiding this comment.
Addressed. I added a new benchmar that directly instantiates both GroupValuesColumn (vectorized) and GroupValuesRows (row-based) and calls their intern() method with identical Int32 data
| } | ||
|
|
||
| #[expect(clippy::needless_pass_by_value)] | ||
| fn query(ctx: Arc<Mutex<SessionContext>>, rt: &Runtime, sql: &str) { |
There was a problem hiding this comment.
It looks like each Criterion iteration calls ctx.sql(sql) before df.collect(), which means planning and logical/physical optimization are currently included in the measured path.
The module docs say the query is pre-planned and that the benchmark is only measuring execution, so this seems a bit inconsistent. It also makes the column-count experiment harder to interpret because planning cost grows with the generated projection and grouping expressions.
Could we construct the plan once during setup and only time repeated execution of that plan? Otherwise it would help to update the benchmark naming/docs to make it clear that planning is intentionally included.
There was a problem hiding this comment.
Instead of pre-planning a physical plan (which can't be re-executed due to internal state), the new benchmark bypasses SQL/planning entirely by calling GroupValues::intern() directly on pre-generated in-memory batches.
| b.iter(|| query(b_ctx.ctx.clone(), &rt, &build_group_by_sql(4))) | ||
| }); | ||
|
|
||
| let b_ctx = prepare_context(&rt, 4, 30); // 30^4 = 810K groups |
There was a problem hiding this comment.
Some of the benchmark names/comments currently report the theoretical key-space size rather than the number of distinct groups actually produced.
With around 1M input rows, the 30^4 = 810K case only produces roughly ~574K observed groups in expectation, and the 100^4 / 500^4 cases are effectively capped near the input row count instead of producing 100M or 62B distinct groups.
Since the PR is trying to identify a distinct-group-count threshold, these labels end up being a bit misleading. Could we either generate data with controlled exact NDV per case, or rename/report these as key-space cardinalities and include the measured distinct counts alongside them?
There was a problem hiding this comment.
Just addresed by using sequential enumeration (global_row % num_distinct_groups decomposed per-column) which guarantees the exact number of distinct groups matches the label. No random sampling
…g overhead, exact NDV Addresses all three review comments from @kosiew: 1. **Implementation comparison**: Benchmarks both GroupValuesColumn (vectorized, via Int32 columns) and GroupValuesRows (row-based, via FixedSizeBinary(4) columns that trigger the fallback path) side-by-side. 2. **Execution-only timing**: Pre-optimizes the logical plan once via `df.into_parts()`. Each benchmark iteration only does physical planning + execution, excluding SQL parsing and logical optimization. 3. **Exact cardinality**: Replaces random sampling with sequential enumeration (`global_row % num_distinct_groups` decomposed per-column), guaranteeing precise distinct group counts with no birthday-paradox error. Additionally motivated by apache#17850, adds comprehensive experiments: - Issue apache#17850 regression reproduction (3 cols, 64 groups, 1M-50M rows) - Low cardinality sweep (8-4096 groups) - Batch size sensitivity (1K-32K) - Column count scaling (2-10 cols, low and high cardinality) - Group count sweep (16 to 1M groups) - Random vs sequential data patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a fair apples-to-apples benchmark that directly calls GroupValues::intern() with identical Int32 data for both GroupValuesColumn (vectorized) and GroupValuesRows (row-based). This eliminates the previous confounding factors (different data types, SQL/planning overhead) and confirms the regression reported in apache#17850: row-based is 16-19% faster at low cardinality (64 groups), with a crossover at ~200K-500K groups where vectorized becomes faster. Experiments: - Issue apache#17850 reproduction (3 cols, 64 groups, 1M-50M rows) - Low cardinality sweep (8-4096 groups) - Batch size sensitivity (1K-32K) - Column count scaling (2-10 cols) - High cardinality scaling (1M groups) - Group count sweep (16 to 1M groups) - Random vs sequential data patterns
| pub mod multi_group_by; | ||
|
|
||
| mod row; | ||
| pub mod row; |
There was a problem hiding this comment.
Required for the benchmark to directly instantiate GroupValuesRows with the same Int32 schema used by GroupValuesColumn. Without this, benchmarks can only trigger the row-based path via an unsupported type, which makes the comparison unfair.
|
Thanks for the review @kosiew. Changes from your comments:
One thing im not sure about: making |
There was a problem hiding this comment.
@nathanb9
Thanks for the updates here. The benchmark changes address the earlier concerns around comparing GroupValuesColumn vs GroupValuesRows directly and removing SQL planning overhead from the measured path.
I still found one issue with the 1M-group benchmark cases. Because generate_batches truncates to whole batches, the generated input for the advertised 1_000_000 row/group cases is actually smaller than the benchmark labels suggest.
| .powf(1.0 / num_cols as f64) | ||
| .ceil() as usize; | ||
|
|
||
| let num_batches = num_rows / batch_size; |
There was a problem hiding this comment.
generate_batches currently computes num_batches = num_rows / batch_size, which drops the remainder rows.
With the default 8192 batch size and num_rows = 1_000_000, the benchmark generates only 122 * 8192 = 999_424 rows. That means the grp_1M / cols_*_grp_1M cases can produce at most 999_424 observed groups, not 1,000,000.
So while the sequential enumeration change fixes most of the earlier cardinality-label mismatch, the exact 1M threshold cases are still slightly off.
Could we either generate the final partial batch, or adjust/report the benchmark labels based on the actual generated row/group counts?
There was a problem hiding this comment.
fixed now I get the remainder as well
| pub mod multi_group_by; | ||
|
|
||
| mod row; | ||
| pub mod row; |
There was a problem hiding this comment.
On the question about pub mod row: if the public access is only needed for the benchmark, I think it would be cleaner to keep mod row; private and expose only the narrow surface that is needed, for example pub use row::GroupValuesRows;.
That keeps the internal module layout private instead of publishing the full module path just for benchmark access.
There was a problem hiding this comment.
Sounds good. pub use row::GroupValuesRows;
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@kosiew Thanks for your thorough review |
Which issue does this PR close?
Rationale for this change
GroupValuesColumn) underperforms the row-based approach (GroupValuesRows) when distinct group count is small relativeto input rows. The benchmark confirms this: row-based is 16-19% faster below ~200K groups, while vectorized wins by 15-33% above ~500K groups.
What changes are included in this PR?
Adds a benchmark in
datafusion/physical-plan/benches/multi_group_by.rsthat directly callsGroupValues::intern()with identical Int32 data for both implementations —no SQL/planning/IO overhead, same schema, same hashing.
Makes
mod rowpublic so the benchmark can instantiateGroupValuesRowsdirectly.Test cases:
Are these changes tested?
cargo fmt --allcargo clippy -p datafusion-physical-plan --bench multi_group_by -- -D warningscargo bench -p datafusion-physical-plan --bench multi_group_byAre there any user-facing changes?
No. This adds a benchmark only.